Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It is widely used in various fields such as economics, biology, engineering, and social sciences.

1 Types of Linear Regression

There are several types of linear regression, depending on the number of variables and the nature of the relationship:

1.1 Simple Linear Regression

Simple linear regression involves one independent variable and one dependent variable. The model is expressed as:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where:

  • \(Y\) : dependent (phenotype) variable you are trying to predict
  • \(X\) : main independent variables of interest
  • \(\beta_0\) : intercept (expected value of \(Y\) when all predictors are zero)
  • \(\beta_1\) : coefficients for main predictors
  • \(\epsilon\) : error term capturing variation in \(Y\) not explained by the model

1.2 Multiple Linear Regression

Multiple linear regression includes two or more independent variables:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \gamma_1 Z_1 + \gamma_2 Z_2 + \dots + \gamma_q Z_q + \epsilon \]

Where:

  • \(Y\) : dependent (phenotype) variable you are trying to predict
  • \(X_1, X_2, \dots, X_p\) : main independent variables of interest
  • \(\beta_0\) : intercept (expected value of \(Y\) when all predictors are zero)
  • \(\beta_1, \dots, \beta_p\) : coefficients for main predictors \(X_1 \dots X_p\)
  • \(Z_1, \dots, Z_q\) : covariates (additional variables you adjust for, e.g., age, sex, PCs in genetics)
  • \(\gamma_1, \dots, \gamma_q\) : coefficients for covariates
  • \(\epsilon\) : error term capturing variation in \(Y\) not explained by the model

One of the method commun method for Multiple Linear Regression is Ordinary Least Squares (OLS), used to estimate the parameters (coefficients) of a linear regression model. OLS finds the line that minimizes the sum of the squared differences between the observed values and the predicted values.

1.3 Ridge Regression

Ridge regression is a type of regularized regression that adds a penalty term to the loss function:

\[ \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2 \]

This helps reduce overfitting in the presence of multicollinearity.

1.4 Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) also adds a penalty term but uses the absolute value of coefficients:

\[ \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j| \]

1.5 How to compute p-value from linear regression

To compute a p-value for each coefficient in a linear regression model, we:

  1. Estimate the coefficient \(\hat{\beta}_i\) using OLS.
  2. Estimate the standard error of \(\hat{\beta}_i\).
  3. Compute the t-statistic:

\[ t_i = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)} \]

  1. Calculate the p-value using the t-distribution with \(n - k\) degrees of freedom, where:
    • \(n\) is the number of observations
    • \(k\) is the number of predictors (including the intercept)

1.6 Why OLS is Best for Inference and p-values

OLS is the most appropriate method for computing p-values in linear regression because:

  • OLS provides unbiased and efficient estimates of the coefficients under the classical linear model assumptions (Gauss-Markov theorem).
  • The distributional results (like the t-distribution of the coefficients) are derived assuming OLS estimation.
  • OLS allows direct computation of standard errors, confidence intervals, and p-values using well-established statistical theory.
  • Alternative methods like Ridge or Lasso do not provide unbiased estimates, so their p-values (if computed at all) are not valid in the traditional frequentist sense.

1.7 OLS p-value Computing Statistics

To compute p-values in OLS regression, we use the following core statistics:

  1. Estimated Coefficients \(\hat{\beta}\)
    • Obtained by minimizing the residual sum of squares:
      \[ \hat{\beta} = (X^TX)^{-1}X^Ty \]
  2. Residual Standard Error \(\hat{\sigma}\)
    • Measures the variability in residuals:
      \[ \hat{\sigma} = \sqrt{\frac{1}{n - k} \sum (y_i - \hat{y}_i)^2} \]
  3. Standard Error of Coefficients
    • Derived from the variance-covariance matrix:
      \[ SE(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (X^TX)^{-1}_{jj}} \]
  4. t-statistic
    • Tests whether each coefficient is significantly different from zero:
      \[ t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \]
  5. Degrees of Freedom
    • Defined as \(n - k\), where \(n\) is number of observations and \(k\) is number of parameters (including intercept).
  6. p-value
    • Computed from the t-distribution:
      \[ p\text{-value} = 2 \cdot P(T > |t_j|), \quad T \sim t_{n-k} \]

A critical step in OLS is computing the inverse of the matrix \(X^TX\), which is needed to estimate coefficients and their standard errors. However, this matrix becomes singular (non-invertible) or nearly singular (ill-conditioned) when predictors are highly collinear that is, when one or more predictors are linearly dependent or strongly correlated.

In such cases, the computation of \(X^TX\) approaches zero, making \((X^TX)^{-1}\) either impossible to compute or extremely unstable, which in turn leads to unreliable and highly sensitive coefficient estimates.

1.8 Collinearity & Near-perfect collinearity case

Collinearity occurs when two or more predictor variables are highly correlated, meaning that in the stoat case every regression table are collinear.

# Set seed for reproducibility
set.seed(123)

# X1 can be any continuous variable
X1 <- runif(100, 0, 1)

# Make X2 perfectly collinear with X1
X2 <- X1  # now X2 = X1 exactly

# Response variable Y
Y <- 5 + 2*X1 + rnorm(100, sd = 0.1)

# Fit linear model with perfectly collinear predictors
model2 <- lm(Y ~ X1 + X2)
summary(model2)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223797 -0.061323 -0.001973  0.059633  0.221723 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.99910    0.01961  254.99   <2e-16 ***
## X1           1.99102    0.03418   58.25   <2e-16 ***
## X2                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09693 on 98 degrees of freedom
## Multiple R-squared:  0.9719, Adjusted R-squared:  0.9716 
## F-statistic:  3393 on 1 and 98 DF,  p-value: < 2.2e-16

When predictors are nearly perfectly collinear, the model may fail to estimate coefficients properly (this case can happend when the number of column are > 2 and at 2 column are identical) :

set.seed(123)
n <- 100

# First column
X1 <- runif(n, 0, 1)

# Second column is exactly the same as the first
X2 <- X1  # perfect collinearity

# Third column ensures rows sum to 1
X3 <- 1 - (X1 + X2)

# Response variable
Y <- 5 + 2*X1 + 3*X2 + rnorm(n)

# Fit linear model
df <- data.frame(Y = Y, X1 = X1, X2 = X2, X3 = X3)
model <- lm(Y ~ ., data = df)
summary(model)
## 
## Call:
## lm(formula = Y ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.23797 -0.61323 -0.01973  0.59633  2.21723 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.9910     0.1961   25.46   <2e-16 ***
## X1            4.9102     0.3418   14.37   <2e-16 ***
## X2                NA         NA      NA       NA    
## X3                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9693 on 98 degrees of freedom
## Multiple R-squared:  0.678,  Adjusted R-squared:  0.6747 
## F-statistic: 206.3 on 1 and 98 DF,  p-value: < 2.2e-16

To handle this special case of collinearity, one of the strategies we will examine in this report is to removing the last column of each predictor matrix \(X\), and merging columns that are perfectly redundant.

2 Linear Regression test methods

2.1 ALL VARIANT PLOT SIGNIFICANCE

## P-value distribution [min/max/mean] : 
##   Method Min         Mean       Max
## 1     PR   0 0.0002560279 0.1141045
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    99.42    99.42    99.42    99.42 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0.634, mean = 4.3553, max = 10.6625
## $pval_hist

## 
## $pval_box

## 
## $pval_hist_small

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.2 ALL VARIANT PLOT NO SIGNIFICANCE

## P-value distribution [min/max/mean] : 
##   Method Min       Mean       Max
## 1     PR   0 0.09983462 0.9998629
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    80.18    80.18    80.18    80.18 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0, mean = 0.2811, max = 2.5267
## $pval_hist

## 
## $pval_box

## 
## $pval_ecdf

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.3 NEAR PERFECT COLLINEARITY WITHOUT MERGE SAME COLUMN SIGNIFICATIF

## P-value distribution [min/max/mean] : 
##   Method Min         Mean       Max
## 1     PR   0 0.0002883187 0.2293766
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    99.40    80.00    80.04    81.68 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 335.9643, mean = 566922183802.968, max = 203644960156122
## Stoat : min = 1.1374, mean = 566920288218.151, max = 203644960156122
## RVTest : min = 156.7742, mean = 121801303130.594, max = 82299951251064.7
## $pval_hist

## 
## $pval_box

## 
## $pval_hist_small

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.4 NEAR PERFECT COLLINEARITY WITHOUT MERGE SAME COLUMN NO SIGNIFICATIF

## P-value distribution [min/max/mean] : 
##   Method Min      Mean       Max
## 1     PR   0 0.1014515 0.9985768
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    80.16    80.00    80.00    80.00 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0.1425, mean = 516.7819, max = 58755.554
## Stoat : min = 0.1425, mean = 516.7819, max = 58755.5772
## RVTest : min = 0.0785, mean = 210.663, max = 16307.0963
## $pval_hist

## 
## $pval_box

## 
## $pval_ecdf

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.5 NEAR PERFECT COLLINEARITY WITH MERGE SAME COLUMN SIGNIFICATIF

## P-value distribution [min/max/mean] : 
##   Method Min         Mean       Max
## 1     PR   0 0.0002073024 0.1246804
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    99.62    99.62    99.62    99.62 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 0.7557, mean = 5.1593, max = 12.4004
## $pval_hist

## 
## $pval_box

## 
## $pval_hist_small

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.6 NEAR PERFECT COLLINEARITY WITH MERGE SAME COLUMN NO SIGNIFICATIF

## P-value distribution [min/max/mean] : 
##   Method Min      Mean      Max
## 1     PR   0 0.1003566 0.999028
## % of p-value significant per method:
##      P_R  P_Maths  P_Stoat P_RVTest 
##    80.18    80.18    80.18    80.18 
## 
## Difference in P-values per method [min/max/mean %]:
## Maths : min = 0, mean = 0, max = 0
## Stoat : min = 0, mean = 0, max = 0
## RVTest : min = 2e-04, mean = 0.3473, max = 3.6209
## $pval_hist

## 
## $pval_box

## 
## $pval_ecdf

## 
## $pval_diff_vs_R

## 
## $pval_diff_hist

## 
## $coef_hist

## 
## $coef_diff_hist

2.7 BETA SIMULATION